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Principal  Activities  and  Findings: 


Spectrum  of  Random  Kernel  Matrices 

We  derive  the  limiting  spectral  density  of  random  matrices  whose  (i,  j)-th  entry  is  f(X_i*T  XJ), 
where  X_1,  ...,X_n  are  i.i.d.  standard  Gaussian  random  vectors  in  R*p,  and  f  is  a  real-valued 
function.  The  eigenvalue  distribution  of  these  kernel  random  matrices  is  studied  in  the  high 
dimensional  /  large  sample  regime  ("large  p,  large  n").  Our  analysis  applies  as  long  as  the 
rescaled  kernel  function  is  generic,  and  particularly,  this  includes  non-smooth  functions,  e.g. 
Heaviside  step  function.  Interestingly,  the  limiting  densities  interpolate  between  the  Marcenko- 
Pastur  density  and  the  Wigner  semi-circle  density. 

Documentation:  J22 


Robust  Principal  Component  Analysis 

We  proved  that  for  data  generated  from  an  elliptical  distribution,  the  limiting  distribution  of 
Tyler’s  M-estimator  for  the  covariance  matrix  converges  to  a  Marcenko-Pastur-type  distribution. 
Elliptical  distributions  play  an  important  role  in  portfolio  theory,  radar,  and  financial  data,  and  are 
typically  used  whenever  the  empirical  distributions  are  heavy-tailed  due  to  outliers. 

Documentation:  J12 


Principal  Component  Analysis  from  Noisy  Projected  Data 

The  sample  covariance  is  the  most  popular  way  to  estimate  the  covariance  matrix  of  a  dataset. 
However,  in  many  situations  the  sample  covariance  cannot  be  formed  directly  from  the 
measurements.  For  example,  when  there  is  missing  data  or  when  the  measurements  are  linear 
projections  of  the  underlying  signals.  While  it  is  possible  to  estimate  the  low  rank  structure 
through  the  matrix  completion/sensing  framework,  solutions  of  the  latter  can  be  obtained  using 
either  semidefinite  program  (nuclear  norm  minimization)  which  is  slow  in  practice  or  alternating 
minimization  that  lacks  in  theoretical  guarantees.  We  show  that  the  low  rank  structure  can  be 
estimated  via  a  solution  of  a  linear  system  that  is  formed  using  tools  from  high  dimensional  PCA 
and  suitable  eigenvalue  shrinkage.  We  applied  this  new  methodology  for  the  denoising  of 
extremely  noise  cryo-electron  microscopy  images  and  to  reveal  three-dimensional  structural 
variability  in  such  datasets. 

Documentation:  J1 1 ,  J1 9,  C31 


Compressive  Sensing  -  Random  Demodulator: 

The  sampling  rate  of  analog-to-digital  converters  is  severely  limited  by  underlying  technological 
constraints.  Recently,  Tropp  et  al.  proposed  a  new  architecture,  called  a  random  demodulator, 
that  attempts  to  overcome  this  limitation  by  sampling  sparse,  band  limited  signals  at  a  rate  much 
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lower  than  the  Nyquist  rate.  An  integral  part  of  this  architecture  is  a  random  bi-polar  modulating 
waveform  (MW)  that  changes  polarity  at  the  Nyquist  rate  of  the  input  signal.  Technological 
constraints  also  limit  how  fast  such  a  waveform  can  change  polarity,  so  we  propose  an 
extension  of  the  random  demodulator  that  uses  a  run-length  limited  (RLL)  modulating  waveform, 
and  which  we  call  a  constrained  random  demodulator  (CRD).  The  RLL  modulating  waveform 
changes  polarity  at  a  slower  rate.  We  establish  that  a  CRD  enjoys  theoretical  guarantees  similar 
to  the  RD  and  that  these  guarantees  are  directly  related  to  the  power  spectrum  of  the  MW. 
Further,  we  show  that  the  relationship  between  the  placement  of  energy  in  the  spectrum  of  the 
input  signal  and  the  placement  of  energy  in  the  power  spectrum  of  the  MW  has  a  major  effect 
on  the  reconstruction  performance  of  signals  sampled  by  a  CRD. 

Documentation:  J1 ,  J8,  C11 ,  C1 2,  C1 6 


Compressive  Sensing  -  information  Theoretic  Limits 

We  approach  the  problem  of  how  to  design  optimal  measurements  through  the  Singular  Value 
Decomposition  or  SVD.  The  SVD  is  the  product  of  three  matrices  and  each  plays  a  role  in  the 
design  of  optimal  linear  measurements.  The  function  of  the  right  eigenvectors  in  the  SVD  of  the 
measurement  matrix  □  is  to  collect  energy  from  the  source,  so  they  should  coincide  with  the 
eigenvectors  of  the  source  covariance.  We  arrange  these  eigenvectors  in  decreasing  order  of 
the  corresponding  singular  vectors,  starting  with  the  biggest  and  going  down.  The  function  of  the 
left  eigenvectors  in  the  SVD  of  the  measurement  matrix  □  is  to  align  high  energy  source  modes 
with  low  noise  modes,  so  they  should  coincide  with  the  eigenvectors  of  the  noise  covariance. 

We  arrange  these  eigenvectors  in  increasing  order  of  the  corresponding  singular  vectors, 
starting  with  the  smallest  and  going  up.  Finally,  the  function  of  the  singular  values  of  the 
measurement  matrix  □  is  to  distribute  the  available  energy  among  the  channel  modes.  Note  that 
we  have  ordered  eigenvalues  so  that  when  we  consider  the  ratio  of  the  ith  noise  singular  value 
to  the  ith  source  singular  value  these  ratios  are  increasing. 

We  design  measurement  matrices  to  maximize  mutual  information  l(x;  y),  because  we  think 
about  using  conditional  mean  estimation  to  recover  the  signal  of  interest  from  the  noisy 
projection.  The  minimum  mean  squared  error  is  the  trace  of  the  MMSE  matrix,  the  lower  bound 
on  the  MMSE  is  minimized  when  the  mutual  information  l(x;  y)  is  maximized,  and  the  inequality 
in  the  lower  bound  is  met  with  equality  when  x  is  Gaussian.  Our  work  takes  advantage  of  a 
relationship  between  the  gradient  of  mutual  information  and  the  MMSE  matrix  that  was 
discovered  by  Guo,  Shamai  and  Verdu  in  2005. 

The  work  of  Verdu  and  collaborators  is  motivated  by  communications,  where  the  aim  is  to 
maximize  mutual  information  between  input  signal  and  received  signal.  In  communications  we 
know  the  statistics  of  the  source  x,  that  is  to  say  we  know  the  correlation  matrix  Dx  and  we  can 
calculate  its  singular  value  decomposition.  If  we  know  the  channel,  and  hence  its  SVD,  then  we 
can  align  the  source  so  that  it  is  minimally  attenuated  by  the  channel.  That  is  the  function  of  the 
precoder  designed  by  Palomar  and  Verdu  that  is  inserted  between  the  transmitter  and  the 
channel.  In  sensing  we  know  the  SVD  of  the  source  and  we  are  simply  trying  to  design  the  SVD 
of  the  measurement  matrix. 

We  have  developed  a  generalization  of  Bregman  divergence  to  unify  vector  Poisson  and 
Gaussian  channels.  We  are  interested  in  vector  Poisson  channels  because  they  are  a  good 
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model  for  X-ray  scatter,  also  for  document  classification.  In  document  classification  we  assume 
L  classes  of  documents  each  characterized  by  a  vector  of  probabilities  over  n  words.  The 
Poisson  model  describes  how  the  words  are  drawn  for  a  document  in  a  given  class.  The 
number  of  words  is  large  so  we  count  the  number  of  times  words  in  subsets  of  the  dictionary 
appear.  These  subsets  act  like  key  words  associated  with  a  given  topic  -  these  are  our 
compressive  measurements.  We  have  applied  our  theory  to  compressive  topic  modeling  for 
analysis  of  document  corpora,  and  improves  upon  the  state  of  the  art  for  the  20  Newsgroups 
corpus. 

Documentation:  J5,  C3,  C10,  C15,  C17,  C18,  C20,  C22,  C26 

Compressive  Sensing  -  Subspace  Modeis 

Many  important  types  of  signal,  including  speech,  faces,  digits  and  fingerprints,  can  be 
accurately  modeled  as  low-dimensional  subspaces  in  a  larger  ambient  space.  Hence  the 
problem  of  using  a  limited  number  of  linear  measurements  to  discriminate  subspaces  excited  by 
Gaussian  noise  is  fundamental  to  modern  detection  and  estimation. 

We  are  able  to  determine  to  within  a  single  measurement  the  minimum  number  of 
measurements  required  to  successfully  reconstruct  a  signal  drawn  from  a  Gaussian  mixture 
distribution  in  the  low  noise  regime.  Our  method  is  to  develop  upper  and  lower  bounds  that  are 
a  function  of  the  maximum  dimension  of  the  linear  subspaces  spanned  by  the  Gaussian  mixture 
components.  We  show  that  an  n-dimensional  signal  that  is  s-sparse  with  non-zero  components 
drawn  independent  identically  distributed  from  a  Gaussian  mixture  distribution  can  be 
reconstructed  perfectly  in  the  low-noise  regime  with  exactly  s-i-1  measurements.  This  estimate  is 
tighter  and  sharper  than  standard  bounds  on  the  minimum  number  of  measurements  needed  to 
recover  sparse  signals  associated  with  a  union  of  subspaces  model.  It  shows  that  it  is  possible 
to  achieve  the  performance  of  intractable  /o-pseudonorm  recovery  algorithms  using  the  optimal 
closed-form  conditional  mean  estimator  within  the  Bayesian  compressive  sensing  paradigm. 

We  derive  these  results  by  developing  a  first-order  low-noise  expansion  of  the  MMSE  that 
captures  the  existence  or  absence  of  an  MMSE  floor  as  well  as  the  rate  of  decay  to  this  floor. 
The  presence  or  absence  of  an  MMSE  floor  depends  only  on  the  relation  between  the  number 
of  measurements  f  and  the  rank  s  of  the  source  covariance.  The  exact  value  of  the  MMSE  floor 
(when  t  is  less  than  s)  and  the  MMSE  power  offset  (when  t  is  at  least  s)  depends  on  the  relation 
between  the  geometry  of  the  measurement  kernel  and  the  geometry  of  the  source.  This 
geometric  relation  is  captured  by  a  multivariate  generalization  of  the  MMSE  dimension 
(introduced  by  Wu  and  Verdu  in  201 1 )  that  distinguishes  MMSE  expansions  associated  with 
different  measurement  kernels  and  source  covariances.  We  are  then  able  to  use  this  geometric 
framework  to  quantify  the  advantage  of  measurement  kernels  that  are  designed  over  those  that 
are  random.  While  kernel  design  does  not  impact  the  phase  transition.  We  are  able  to  show 
that  designed  kernels  can  improve  reconstruction  performance  both  in  terms  of  a  lower  error 
floor  (if  present)  and  a  lower  power  offset.  We  have  also  connected  theory  to  the  practice  of 
image  reconstruction  using  a  20  class  Gaussian  mixture  model  for  non-overlapping  8x8  image 
patches  derived  from  100,000  patches  randomly  sampled  from  500  images  in  the  Berkeley 
Segmentation  Dataset.  The  phase  transition  phenomenon  is  clearly  visible  in  our  reconstruction 
of  the  image  Barbara  (which  was  not  of  course  included  in  the  original  training  set). 

Documentation:  J2,  J4,  J6,  J7,  J10,  C2,  C4,  C5,  C6,  C8,  C9,  C22,  C25 
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Deep  Learning 

Deep  neural  networks  have  proved  very  successful  in  domains  where  large  training  sets  are 
available,  but  when  the  number  of  training  samples  is  small,  their  performance  suffers  from 
overfitting.  Prior  methods  of  reducing  over-fitting  such  as  weight  decay.  Dropout  and 
DropConnect  are  data-independent.  Our  work  also  motivated  by  the  problem  of  overfitting,  but 
the  framework  for  learning  features  that  are  robust  to  data  variation  is  different,  and  we  are  able 
to  explicitly  tradeoff  the  discriminative  value  of  learned  features  against  the  generalization  error 
of  the  learning  algorithm.  Our  theoretical  analysis  starts  with  a  cover  of  the  data  space,  which  is 
a  partition  into  subsets  with  the  property  that  distance  between  pairs  in  the  same  subset  is 
bounded  by  □.  We  achieve  robustness  by  encouraging  the  transform  that  maps  data  to  features 
to  be  a  local  isometry,  so  that  distances  can  increase  by  at  most  □.  All  that  remains  is  to  relate 
loss  to  distance,  and  we  are  able  to  achieve  (K,  2A(nnn))-robustness,  where  A  is  the  Lipschitz 
constant  of  the  loss  function. 

Documentation:  C19,  C30,  C32 

Compressive  Ciassification 

We  have  shown  that  fundamental  limits  on  classification  cannot  be  avoided  in  a  world  where 
there  is  mismatch  between  a  class  and  the  subspace  used  to  model  that  class.  Our  method  is  to 
connect  the  problem  of  using  a  limited  number  of  linear  measurements  to  discriminate 
subspaces,  to  that  of  using  multiple  transmit  antennas  to  communicate  over  a  non-coherent 
wireless  channel.  This  connection,  between  two  very  different  fields,  means  that  capacity  results 
obtained  by  Zheng  and  Tse  for  wireless  communication  can  be  used  to  derive  fundamental 
limits  on  compressive  classification.  When  a  classifier  tries  to  identify  k-dimensional  subspaces 
from  an  M-dimensional  projection,  corrupted  by  noise/ mismatch  of  variance  we  have  shown 
that  classification  fails  with  high  probability  when  there  are  more  than  (1/n)'^  '^  subspaces  to 
discern.  When  k  is  at  least  M/2  the  converse  holds  true;  classification  succeeds  with  high 
probability  when  there  are  fewer  than  (1/n)^  '^  subspaces  to  discern. 

Rate-distortion  theory  is  the  branch  of  information  theory  that  deals  with  the  lossy  compression 
of  random  sources.  Shannon’s  famous  rate-distortion  theorem  relates  the  encoding  rate  R  and 
the  expected  distortion  according  to  the  mutual  information  between  the  source  and  its  estimate. 
Ahmad  proposed  to  use  rate-distortion  analysis  to  bound  learning  performance,  by  treating  the 
posterior  distribution  as  a  soft  version  of  the  MAP  classifier.  The  posterior  distribution  is  a 
random  object,  and  it  takes  the  role  of  the  source,  which  we  want  to  represent  up  to  some 
distortion.  The  training  samples  take  the  role  of  the  finite  rate  encoding  of  the  posterior.  The 
higher  the  number  of  samples  the  more  information  is  conveyed  about  the  posterior.  The 
distortion  measure  is  the  average  ii  distance  between  the  posterior  and  the  estimate  produced 
by  the  learning  machine,  and  a  classic  result  is  that  the  generalization  error  is  bounded  above 
by  the  ii  loss.  We  have  used  the  machinery  of  rate-distortion  theory  to  derive  bounds  on  the 
tradeoff  between  classifier  performance  and  the  size  of  the  training  set.  These  bounds  involve  a 
quantity  called  the  Interpoiation  Dimension  that  captures  inherent  complexity  of  the  posterior. 
Interpolation  dimension  plays  a  role  similar  to  the  VC  dimension  in  the  classical  theory,  but 
provides  bounds  that  are  much  tighter,  particularly  when  the  number  of  training  samples  is 
small. 
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Wireless  Communication 

We  have  developed  protocols  that  are  able  to  take  advantage  of  stale  channel  feedback.  We 
have  shown  that  if  channel  statistics  are  known,  then  it  is  possible  to  anticipate  the  statistics  of 
collisions,  and  to  transmit  linear  combinations  of  inputs  that  can  be  resolved  at  the  receivers. 

Documentation:  C21,  C28,  C31 


Data  Storage 

Use  of  Flash  memory  is  increasing  because  capacity  is  increasing,  and  the  cost  differential 
between  Flash  and  other  storage  technologies  (especially  hard  drives)  is  narrowing.  NAND 
Flash  dominates  solid-state  drives  (SSDs)  and  typical  storage  devices  use  multi-level  cells  with 
2  (SLC),  4  (MLC)  or  8  (TLC)  levels  per  cell.  MLCs  are  usually  preferred  because  they  are  more 
mature  than  TLCs  and  provide  better  storage  density  than  SLCs.  One  drawback  to  using  Flash 
is  that  we  can  only  erase  a  Flash  cell  a  given  number  of  times  before  that  cell  can  no  longer 
retain  information.  The  number  of  Program/Erase  (P/E)  cycles  that  a  cell  can  tolerate  depends 
on  the  type  of  the  cell  used  (SLC,  MLC  or  TLC),  and  the  scale  of  the  Flash  technology.  Another 
practical  difficulty  is  that  the  4  physical  levels  per  MLC  cell  are  accessible  only  as  two  virtual  2- 
level  cells  on  separate  pages.  We  have  developed  a  method  of  creating  virtual  Flash  cells  with 
several  logical  levels  that  avoids  the  need  to  change  current  hardware.  We  have  demonstrated 
how  to  implement  waterfall  coding  on  the  new  virtual  cells,  and  have  introduced  a  new  pseudo¬ 
erase  operation  that  further  extends  memory  lifetime.  Our  work  connects  the  current  Flash 
interface  with  the  promise  of  coding  techniques  developed  by  the  information  theory  and  coding 
community. 
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